A Practitioner's Guide for Static Index Pruning

نویسندگان

  • Ismail Sengör Altingövde
  • Rifat Ozcan
  • Özgür Ulusoy
چکیده

We compare the termand document-centric static index pruning approaches as described in the literature and investigate their sensitivity to the scoring functions employed during the pruning and actual retrieval stages. 1 Static Inverted Index Pruning Static index pruning permanently removes some information from the index, for the purposes of utilizing the disk space and improving query processing efficiency. In the literature, several approaches are investigated for the static index pruning techniques. Among those methods, the term-centric pruning (referred to as TCP hereafter) proposed in [3] is shown to be very successful at keeping the top-k (k≤30) answers almost unchanged for the queries while significantly reducing the index size. In a nutshell, TCP scores (using the Smart’s TFIDF function) and sorts the postings of each term in the collection and removes the tail of the list according to some decision criteria. In [1], instead of the TFIDF function, BM25 is employed during the pruning and retrieval stages. In that study, it’s shown that by tuning the pruning algorithm according to the score function, it is possible to further boost the performance. On the other hand, the document-centric pruning (referred to as DCP hereafter) introduced in [2] is also shown to give high performance gains. In DCP approach, only those terms that can most probably be queried are left in a document, and others are discarded. The importance of a term for a document is determined by its contribution to the document’s Kullback-Leibler divergence (KLD) from the entire collection. However, the experimental setup in this latter work is significantly different than that of [3]. That is, only the most frequent terms of the collection are pruned and the resulting (relatively small) index is kept in the memory, whereas the remaining unpruned body of index resides on the disk. During retrieval, if the query term is not found in the pruned index in memory, the unpruned index is consulted. Thus, it is hard to infer how these two approaches, namely, TCP and DCP, compare to each other. Furthermore, given the evidence of recent work on how tuning the scoring function boosts the performance [1], it is important to investigate the robustness of these methods for different scoring functions that are employed during the pruning and retrieval, i.e., query execution. In this paper, we provide a performance comparison of TCP and DCP approaches in terms of the retrieval effectiveness for certain pruning levels. Furthermore, for TCP, we investigate how using the Kullback-Leibler divergence scores, instead of TFIDF or BM25, during the pruning affects the performance. This may allow applying the TCP method independent of the retrieval function and thus providing more flexibility for the 676 I.S. Altingovde, R. Ozcan, and Ö. Ulusoy retrieval system. For the sake of completeness, we also employ other scoring functions instead of KLD while selecting the most promising terms of a document in the DCP approach, and question whether KLD is the best scoring mechanism for this case. Our study puts light on the sensitivity of TCP and DCP approaches to the scoring functions and provides an in-depth comparison of these strategies under the same conditions. 2 Experimental Set-Up and Results Pruning strategies. For both TCP and DCP, we attempt to adapt the best set-up as reported in corresponding works [1, 2, 3]. In particular, for both approaches it’s shown that using an adaptive strategy while deciding what to prune is better than a uniform strategy. Below, we outline these strategies as employed in our study. • TCP(I, k, ε): For each term t in the index I, first the postings in its posting list are sorted by a function (referred to as PruneScore function hereafter). Next, the k highest score, zt, is determined and all postings that have scores less than zt * ε are removed. Note that, as we are not considering theoretical guarantees in this paper, we use the zt scores as is, without the shifting operation as proposed in [2]. A similar approach has also been taken in [1]. In this algorithm, k is the number of results to be retrieved and ε is the parameter to set the pruning level. In [3], Smart’s TFIDF is used as both the PruneScore and retrieval function. • DCP(D, λ): For each document d in the collection D, its terms are sorted by the PruneScore function. Next, the top |d|*λ terms are kept in the document and the rest are discarded. The inverted index is created over these pruned documents. In [2], KLD is employed as the PruneScore function and BM25 is used as the retrieval function. In this paper, we consider the impact of the following parameters for these strategies. • PruneScore & retrieval functions: First, for TCP and DCP, we employ each of the scoring functions (TFIDF, BM25 and KLD, as described in [1, 2, 3]) during the pruning stage. For the cases of using TFIDF and BM25 during the pruning, it’s intuitive to use the same function in the retrieval stage, too. When KLD is used for pruning, we experiment with each of the TFIDF and BM25 functions, since KLD itself cannot be employed as a query-document similarity function. Note that, KLD is essentially proposed to be used with DCP [2], and our study is the first to use it for a term-centric approach. By this way, we attempt to score the postings of a list and prune them in a manner independent from the retrieval function. That is, we prune a posting by considering not only how “good” (or, “divergent”) that particular term is for that document, but also how that posting’s KLD score is ranked among the other postings of this term. Clearly, this is different from the DCP approach. The preliminary results are, although not conclusive, encouraging. For the sake of completeness, we also employ TFIDF and BM25 functions instead of KLD in the term scoring for DCP. It turns out that, KLD may not always be the best choice for DCP approach, and incorporating the actual retrieval function to the pruning stage can perform better. • Document lengths: In [1], it is demonstrated that updating the document lengths after the index pruning further improves the performance. For all experiments, we try both of the approaches, namely fix_dl where document lengths are not updated after the pruning, and upd_dl, where lengths are updated. A Practitioner’s Guide for Static Index Pruning 677 Note that, for TCP case where BM25 is employed during pruning, in addition to updating the lengths, we employed the other tunings reported to perform best in [1]. In particular, we prune all terms with document frequency > |D|/2 (where |D| is the number of documents in the collection), and don’t update the average document length. This case is denoted as “Bm25*” in Figure 1. MAP KLD & TFIDF functions 0.05 0.07 0.09 0.11 0.13 0.15 0.17 10% 20% 30% 40% 50% 60% 70% 80% 90% % pruning level M A P Upd_dl_Tfidf_Tfidf Upd_dl_KLD_Tfidf Fix_dl_Tfidf_Tfidf Fix_dl_KLD_Tfidf Baseline_Tfidf MAP KLD & BM25 functions 0.05 0.07 0.09 0.11 0.13 0.15 0.17 10% 20% 30% 40% 50% 60% 70% 80% 90% % pruning level M A P Upd_dl_Bm25*_Bm25* Upd_dl_KLD_Bm25 Fix_dl_Bm25_Bm25 Fix_dl_KLD_Bm25 Baseline_Bm25

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Diversification Based Static Index Pruning - Application to Temporal Collections

Nowadays, web archives preserve the history of large portions of the web. As medias are shifting from printed to digital editions, accessing these huge information sources is drawing increasingly more attention from national and international institutions, as well as from the research community. These collections are intrinsically big, leading to index files that do not fit into the memory and ...

متن کامل

Entropy-Based Static Index Pruning

We propose a new entropy-based algorithm for static index pruning. The algorithm computes an importance score for each document in the collection based on the entropy of each term. A threshold is set according to the desired level of pruning and all postings associated with documents that score below this threshold are removed from the index, i.e. documents are removed from the collection. We c...

متن کامل

Static Security Constrained Generation Scheduling Using Sensitivity Characteristics of Neural Network

This paper proposes a novel approach for generation scheduling using sensitivitycharacteristic of a Security Analyzer Neural Network (SANN) for improving static securityof power system. In this paper, the potential overloading at the post contingency steadystateassociated with each line outage is proposed as a security index which is used forevaluation and enhancement of system static security....

متن کامل

Static Index Pruning for Information Retrieval Systems: A Posting-Based Approach

Static index pruning methods have been proposed to reduce size of the inverted index of information retrieval systems. The goal is to increase efficiency (in terms of query response time) while preserving effectiveness (in terms of ranking quality). Current state-of-the-art approaches include the term-centric pruning approach and the document-centric pruning approach. While the term-centric pru...

متن کامل

Static Pruning of Terms in Inverted Files

This paper addresses the problem of identifying collection dependent stop-words in order to reduce the size of inverted files. We present four methods to automatically recognise stop-words, analyse the tradeoff between efficiency and effectiveness, and compare them with a previous pruning approach. The experiments allow us to conclude that in some situations stop-words pruning is competitive wi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009